ReHoGCNES-MDA: prediction of miRNA-disease associations using homogenous graph convolutional networks based on regular graph with random edge sampler

Abstract Numerous investigations increasingly indicate the significance of microRNA (miRNA) in human diseases. Hence, unearthing associations between miRNA and diseases can contribute to precise diagnosis and efficacious remediation of medical conditions. The detection of miRNA-disease linkages via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach. Here, we introduced a computational framework named ReHoGCNES, designed for prospective miRNA-disease association prediction (ReHoGCNES-MDA). This method constructs homogenous graph convolutional network with regular graph structure (ReHoGCN) encompassing disease similarity network, miRNA similarity network and known MDA network and then was tested on four experimental tasks. A random edge sampler strategy was utilized to expedite processes and diminish training complexity. Experimental results demonstrate that the proposed ReHoGCNES-MDA method outperforms both homogenous graph convolutional network and heterogeneous graph convolutional network with non-regular graph structure in all four tasks, which implicitly reveals steadily degree distribution of a graph does play an important role in enhancement of model performance. Besides, ReHoGCNES-MDA is superior to several machine learning algorithms and state-of-the-art methods on the MDA prediction. Furthermore, three case studies were conducted to further demonstrate the predictive ability of ReHoGCNES. Consequently, 93.3% (breast neoplasms), 90% (prostate neoplasms) and 93.3% (prostate neoplasms) of the top 30 forecasted miRNAs were validated by public databases. Hence, ReHoGCNES-MDA might serve as a dependable and beneficial model for predicting possible MDAs.


INTRODUCTION
MicroRNA (MiRNA), a category of minuscule RNA comprising ∼22 nucleotides and classified as noncoding RNA, was initially unearthed in 1993 [1,2], which significantly inf luences the modulation of target gene expression through inducing cleavage degradation, halting translation and other structural regulatory mechanisms [3].A growing number of research highlights the involvement of miRNAs in crucial biological functions including cell growth and diversification, senescence-induced apoptosis, immune reactions, signaling pathways, tumor penetration and viral incursions [4][5][6][7][8].For example, a simultaneous decline in mir-103 or mir-107 levels alongside an elevation in silk protein levels was observed in a transgenic mouse model replicating Alzheimer's disease [9].Gao et al. [10] identified the early onset dysregulation of mir-145 and mir-199 expression in the early stages of hepatitis B virus-associated multistep liver cancer.A recent study has shown that mir-23, mir-24 and mir-27 contain harbor potential therapeutic elements for tackling ischemic cardiac and vascular disorders [11].
Hence, unveiling potential miRNA-disease connections could furnish a more profound comprehension of the molecular underpinnings of diseases, enabling early intervention.Efficient harnessing of information regarding miRNA-disease relations can enhance the diagnosis, prognosis and remediation of intricate human diseases [12,13].As of now, the majority of the miRNA-disease associations (MDAs) are sourced from biological experiments, which are both time-intensive and costly [14].Instead, computational methods are developed to predict potential MDAs to help researchers prioritize the potential MDAs for experimental validation [15,16].Generally, these computational methods are categorized into four main types: similarity-based methods [17][18][19][20][21][22][23], traditional machine learning [24][25][26][27][28], deep learning [29][30][31] and graph neural network (GNN) algorithms [32][33][34].Table S1 shows summary of four computational methods mentioned above for MDA prediction.For example, DNRLMF-MDA proposed by Yan et al. [23] computes the probability that a miRNA would interact with a disease by a logistic matrix factorization method, where latent vectors of miRNAs and diseases represent the properties of miRNAs and diseases, respectively, and further improve prediction performance via dynamic neighborhood regularized.Xuan et al. [24] developed a model, termed HDMP, grounded on methods of similarity.The functional resemblance among miRNAs was deduced based on disease terminologies and the similarity of disease phenotypes.Nonetheless, HDMP falls short in predicting prospective miRNAs for novel diseases devoid of any known associated miRNAs.A variety of traditional machine learning-and deep learning-based frameworks have been devised to predict possible MDAs.Chen et al. [35] introduced RKNNMDA, which employs SVMs to rank K-nearest neighbors (KNN) and uses weighted voting to predict MDAs.Zhao et al. [27] employed adaptive boosting for MDA prediction, utilizing a decision tree as a basic classifier before amalgamating weak classifiers to compose a robust classifier based on respective weights.These algorithms were leveraged to derive effective feature representations and tackle the specific optimization challenge to predict reliable MDAs.Deep learning approaches have surpassed traditional machine learning methods in terms of accuracy [36,37].A common suggestion to augment accuracy via deep learning is to employ more data for training, a strategy that typically doesn't augment classical machine learning algorithms' accuracy, necessitating researchers to seek more refined methods for accuracy enhancement.Wang et al. [29] present a novel datadriven end-to-end learning-based method of neural multiplecategory miRNA-disease association prediction (NMCMDA) for predicting multiple-category MDAs.As deep learning is a black box model, hyperparameters selection and network design are challenging [38,39].Moreover, deep learning methods require significant execution time with the substantial compute and memory operations [40].
GNNs, especially graph convolutional networks (GCNs), are widely used in bioinformatics applications due to their robust data comprehension and cognitive capabilities.They are employed in various tasks such as drug-target interactions predictions [41][42][43][44][45], gene-disease association identification [46][47][48], drug-drug interaction predictions [49][50][51][52] and so on.For example, the GraRep technique, as introduced by [55], integrates principles from similarity-based, machine learning and GNN methodologies together.It builds a heterogeneous GCN encompassing miRNA, disease, drug, protein, lncRNA, along with their interactions.Additionally, disease similarity data are taken into account when forming embedding representations.In the final stage, the random forest (RF) algorithm is employed to predict potential MDAs.GNNs have strong data and knowledge representation capabilities, which can not only express the independent characteristics of samples (nodes), but also express the connections (links) between samples of the same type or even different types.Unlike the pictures and sequences that are data in Euclidean space and their graph structure is fixed, GNNs deal with data like biological data in non-Euclidean spaces with extremely f lexible graph structures.Several studies have endeavored to employ varied GCN architectures for predicting MDAs, and these can be broadly categorized into three classes: (i) pairwise GCNs [53][54][55], which deploy two separate GCNs to derive embeddings for miRNAs and diseases, thereafter predicting MDAs.However, this graph structure overlooks the relationships between miRNAdisease pairs (MDPs).(ii) Link prediction on a bipartite graph [56], where both miRNAs and diseases are regarded as nodes while MDAs are regarded as edges.In this setup, a large number of negative samples are used as edges during node updates, leading to the over-smoothing issue due to the inclusion of numerous false neighbors.(iii) Node prediction on a fully connected graph [57][58][59], where the graph's high density causes the embeddings of each node to converge toward uniformity as nodes continue to update, thereby also triggering the over-smoothing problem.
A homogenous graph refers to a graph where all nodes and edges belong to the same type while in a heterogeneous graph, nodes and edges may have multiple types, representing different entities or relationships.The classification of graphs into homogenous and heterogeneous is based on the properties of nodes and edges.Figure 1 illustrated the basic structures of homogenous graph (A) and heterogeneous graph (B) used in MDA task.Pathbased MiRNA-disease association (PBMDA) prediction model was proposed by You [85] constructed a heterogeneous graph consisting of three interlinked sub-graphs and further adopted depthfirst search algorithm to infer potential MDAs.Chen [17] designed a Laplacian score of homogenous graphs to calculate the global similarity of networks and proposed a global similarity method based on a two-tier random walk to reveal the correlation between miRNAs and diseases.
Based on the different distributions of node degrees, graphs can be classified into regular graph and non-regular graph.In a regular graph, each node has the same degree, meaning that every node has the same number of neighbors while nodes in a non-regular graph may have different degrees, meaning that nodes in the graph can have different numbers of neighbors.Table 1 summarized different graph structures and prediction models for MDA prediction.Regular undirected graph, which is a simple type of graph structures but is hard to generate, not only has good combinatorial properties but also has strong algebraic constraints, which haven't been widely used in bioinformatics area [68,69].Fully connected network based on similarity is a common method to construct regular graphs.NIMCGCN [54] first learn miRNA and disease latent feature representations from fully connected homogenous miRNA and disease similarity graph, respectively.Then, learned features were input into a novel neural inductive matrix completion model to generate an association matrix completion.Another common method is K-NN method to construct regular graphs.ProteinGCN [76] represented a protein as a graph in which atoms are represented by nodes, and edges connect to k nearest neighbors of each node atom.This representation has rotation invariance and reduces the information redundancy brought about by fully connected networks.Chu et al. [32] designed a homogenous graph of MDPs and the edge is constructed between the node and its k nearest neighbors based on the node information.Moreover, the aforementioned methods only accomplish a portion of the MDAs predictions, neglecting the predictive task for new diseases, new miRNAs and their associations.Hence, it's worthwhile to investigate new GCN architectures to fully harness the structural attributes of the biological graph.Additionally, validation or independent testing should be conducted to evaluate the overall predictive efficacy on new miRNAs and diseases that were not included in the training dataset.

Link prediction
Despite numerous efforts [60][61][62] to minimize training expenditures with GCNs, these approaches still encounter hurdles in terms of accuracy, scalability and training complexity .The 'neighbor explosion' phenomenon is a common hurdle when dealing with complex large graphs leading to increased complexity of node representation and stochastic gradient calculation will exponentially increase with the increasing number of message passing layers.Additionally, stacking multiple layers of GCN often results in over-smoothing or overfitting issues, causing nodes tend to have similar representations after aggregation operations as the neural network goes deeper.Researchers have proposed various graph sampling techniques to reduce the number of nodes involved in message passing, thereby lowering training costs.The most common techniques include node sampling (such as GraphSAGE [63], PinSage [64], VRGCN [65]), layer sampling (such as FastGCN [66], ASGCN [67]) and edge sampling.
In light of these challenges, this study introduces a novel ReHoGCNES-MDA method, grounded on a regular undirected graph, for MDA prediction, and evaluates it across four distinct prediction tasks, incorporating both aforementioned viewpoints simultaneously.Initially, we proposed a homogenous GCN with a regular graph structure (ReHoGCN) utilizing k-nearest neighbors (k-NN) algorithms [70] to efficiently and adequately probe information like node features (i.e.miRNAs and diseases) and network topology (i.e.miRNA-disease links).For comparison, we also introduced homogenous GCN non-regular graph structure (UReHoGCN) via k-means [71] algorithms, and heterogeneous GCN (HeGCN).A random edge sampler (ES) strategy was utilized to hasten processes and diminish training complexity.Subsequently, these three structurally distinct GCNs were tested on four experimental tasks concerning MDA predictions for the first time: specifically, predicting new associations between

Datasets
In our study, experimentally verified known MDAs were obtained from the Human-miRNA Disease Database (HMDD) v2.0 [72], released in 2014, encompassing 5430 associations between 495 miRNAs and 383 diseases.The subsequent release of HMDD v3.0 [73] in 2019 incorporated newly discovered MDAs, while the MDAs from HMDD v2.0 were designated as old.Known MDAs were regarded as positive samples, with the remaining were deemed indeterminate and categorized as unlabeled data, from which a number of pairs equivalent to the positive pairs were randomly chosen to form a negative dataset.Utilizing this approach, we devised five datasets for the first time, as detailed in Table 2 and Table S2: (i) the training set, comprising all old diseases, old miRNAs and their established association pairs; (ii) Tp test set, inclusive of all old diseases, old miRNAs and their newly identified association pairs; (iii) Td test set, containing all new diseases, old miRNAs and their interaction pairs; (iv) Tm test set, with all old diseases, new miRNAs and their association pairs; (v) Tn test set, encompassing all diseases, new miRNAs and their association pairs.

Node feature representations
Integrated features based on the diseases semantic similarity, miRNAs functional similarity and Gaussian interaction profile (GIP) kernel similarities were adopted in this study.

Disease semantic similarity matrix
Utilizing the Medical Subject Headings descriptors [24], we calculated the Disease Semantic Similarity Matrix (DSSM), which is accessible at https://www.ncbi.nlm.nih.gov/.The Directed Acyclic Graph (DAG), illustrating the relationships among various diseases, has been extensively utilized in numerous studies [56,74] to construct the DSSM.Two distinct DSSMs were defined based on two different rationales.DSSM 1 is formulated under the premise that two diseases sharing a larger portion of their DAGs are more similar.DSSM 2 , on the other hand, posits that a disease occurring in more (or fewer) DAGs may be more common (or specific).
To derive a more rational DSSM, element-wise averaging was conducted on the aforementioned DSSMs to amalgamate them into the final DSSM.

MiRNA functional similarity matrix
The construction of the miRNA Functional Similarity Matrix (MFSM) is predicated based on the assumption that miRNAs exhibiting similar functions are more prone to association with diseases showcasing similar phenotypes, and the converse holds true as well [74].The gene-gene interaction data were procured from HumanNet [75], where each edge is ascribed a weight determined by a corresponding log-likelihood score.Initially, we applied Min-Max normalization to the scores.Subsequently, the functional similarity between any pair of genes is calculated as follows: where w g i , g j represents the normalized value transformed by Min-Max normalization, E denotes the edges set in gene interaction network and e g i , g j denotes the edge.
For miRNAs i and j, G i represents a set of genes associated with miRNA i, and G j represents a set of genes associated with miRNA j, in which G i contains |G i | genes and G j contains G j genes.Then, the functional similarity of miRNAs m i and m j is calculated as follows: where |G| is the cardinality of gene set G, and S g, G = max gt∈G S g, g t .The MFSM is accessible at https://www.cuilab.cn/files/images/cuilab/misim.zip.

GIP kernel similarity
The GIP kernel similarity for miRNAs (diseases) was constructed based on the assumption that functionally (phenotypic) similar miRNAs (diseases) exhibit similar patterns with diseases (miR-NAs) [76].Taking the construction of GIP kernel similarity matrix (DGSM) as an example.First, for a given disease, we used vector P d i to denote the interaction profile of this disease, which is a binary vector and corresponds to the ith row of miRNA-disease adjacency matrix.The values ('1' or '0') in P d i indicate whether disease has a known association with each miRNA or not.Then, we can obtain DGSM by calculating the similarity between each disease pair: where n represents the number of rows of the MDA matrix A, that is, the number of all miRNAs.γ d is a initial bandwidth parameter that can be determined by further cross-validation.According to previous research [76,86], almost all researchers have simply set it to 1.The parameter γ d was then employed to regulate the kernel bandwidth.The GIP kernel similarity matrix for miRNAs (MGSM) can be calculated in a similar manner.

Integrating similarity for miRNAs and diseases
Given the prevalence of sparse values in the earlier obtained MFSM and DSSM, we amalgamated the GIP kernel similarity matrices MGSM and DGSM to address the zero-value entries, respectively.Consequently, we acquired the Integrated miRNA Similarity Matrix (IMSM) and Integrated Disease Similarity Matrix (IDSM).Utilizing IMSM as an illustration, the integrated equations [77] are:

ReHoGCNES model for predictions of MDAs
There are two crucial steps to build ReHoGCN model we proposed: (i) construction of regular graph by k-NN algorithm and (ii) predictions of MDAs using a novel GCN model via edge-based graph sampling.

Construct homogenous GCN with regular graph structure
Consider one layer GCN: Here, Ais denoted as an adjacency matrix of graph G. Generally, whereI N is the identity matrix.
∼ D is the normalized degree matrix.σ (•) represents an activation function.The weight matrix W l is specific and trainable to each layer.H l ∈ R N×D is the matrix of activations in the lth layer.
Equation ( 6) can be divided into two steps: (i) first, a new feature matrix Y was obtained by utilizing graph convolution, (ii) Then add a fully connected layer on Y, H l+1 = YW l .Symmetic normalized Laplacian matrix is . GCN can be regarded as a special form of Laplacian smoothing.Laplacian quadratic form can represent the smoothness of graph network structure.From the point of view of information aggregation, normalized Laplacian matrix is a weighted aggregation of the first-order neighbor information of a node, and the weight is inversely proportional to the degree of the node.A regular graph is a perfect fit with the Laplacian idea of smoothing, which is to make a point as similar as possible to the points around it, and the new feature of each node is the mean of the features of the nodes around it.The property of regular graph is able to let each node better use the information of the surrounding nodes.
The heterogeneity of biological data often poses challenges in obtaining regular graphs.To thoroughly encapsulate the structural information within the feature space, we construct adjacency matrix using the k-NN algorithm, focusing on the node feature and the topological relationships between miRNAs and diseases.This adjacency matrix represents the similarity or distance relationships between data points, with the KNN method retaining the K nearest neighbors for each data point.The specific steps are as follows: (i) compute the distance or similarity between data points to generate a distance matrix or similarity matrix.(ii) For each data point, select the K nearest points from its neighbors and set the corresponding row in the matrix to represent the neighbors of that data point.(iii) Set non-symmetric elements in the adjacency matrix to 0 and symmetric elements to the average of their distances or similarities.It is clear that k-NN algorithm is an efficient way to generate regular graphs because degree matrix of graph obtained by k-NN algorithm is diag k, . . ., k .Given that the k-NN algorithm is applied across all data, we assign a label of 0 to nodes within the test set to ensure there's no leakage of test data during the training phase.This graph construction approach facilitates the generation of a regular graph in a simple yet effective manner, enabling the optimal utilization of graph information.
In comparison, we also proposed a homogenous GCN with a non-regular graph structure (UReHoGCN) utilizing k-means algorithms, and a HeGCN.UReHoGCN replaces k-NN algorithm with kmeans algorithms to generate different numbers of neighbors for each node.HeGCN consists of three interlinked sub-graphs whose adjacency matrices are disease-disease fully connected matrix, miRNA-miRNA fully connected matrix and miRNA-disease link matrix.
After parameter-tuning (Supplementary Figures S1 and S2), the best k parameter is 5 for ReHoGCN model and the best n parameter is 300 for UReHoGCN model.Detailed degree statistics and degree distribution of the three proposed models are shown in Table 3 and Figure S3, respectively.

ReHoGCN model with random ES
We transformed the MDA prediction task from link prediction to node prediction, which came with increased training complexity due to the growth in the number of nodes and edges within the graph.To address this issue, we employed a random ES strategy known for its scalability, efficiency and low training complexity, which also aids in averting the over-smoothing problem.A crucial consideration when defining an ES is to ensure that edges with non-negligible probability are sampled.Concurrently, normalization was carried out to reduce the variance in aggregated node information and mini-batch loss in full GCNs.The probability of an edge (u, v) being sampled in a subgraph was calculated as follows: where deg(u) and deg(v) mean the degree of node u and v, respectively.E is the set of edges and u , v is an any edge in set E. M edges randomly sampled (with replacement) from E according to p. From Equation (8), each edge has non-negligible probability to be sampled.However, this sampler that preserves connectivity characteristic of graph G will almost inevitably introduces bias into minibatch estimation.Here, we present normalization techniques to eliminate biases.
where C v denotes the frequency of node v s appearance within the N subgraphs and C u,v denotes the frequency of edge (u, v)'s appearance within the N subgraphs.For a regular graph, p (l) e = p.For a previously sampled u (l) to establish a connection to layer l + 1, at least one of its edges has to be selected by the layer l+1 sampler.It is obvious that the probability of a node in input layer survive N number of independent sam- . Such layer sampler may yield an overly sparse minibatch for L > 1.Moreover, the connectivity within a minibatch remains unaffected with the depth of GCN, implying that if an edge is present in layer l, it persists through all layers.During propagation within the GCN layer, precise node embedding can be derived within subgraphs, and sampled nodes can mutually reinforce each other without requiring external information from outside the batch.This methodology naturally alleviates the neighbor explosion dilemma, typically faced by GCN algorithms.Additionally, the preprocessing overhead is minimal as all subgraphs can be utilized as minibatches throughout the training phase.S3.Table 4 listed experimental environment like hardware environment, software environment, program languages and libraries used in our work.And, the data and source codes are available from https://github.com/yufangzsjtu/ReHoGCNES-MDA to use this architecture and reproduce the results.

Overall approach and experimental execution
All experiments were conducted 10 times to ascertain average scores of the prediction outcomes.The evaluation metrics for the model encompass accuracy, precision, recall, F1-score and the area under the Receiver Operating Characteristic curve (AUC).

Prediction performance of three graph architecture
Model evaluation metrics of ReHoGCN and comparisons with other proposed models UReHoGCN and HeGCN in this work is shown in Table 5.Table 5 illustrated that ReHoGCN with regular graph structure achieved better AUC than unregular graph structure UReHoGCN and HeGCN on both four MDP prediction tasks.Besides, UReHoGCN achieved second best AUC on MDA predictions.This shows that taking associations among MDPs into consideration is an easy and effective way to the fusion of heterogeneity.The predicament of heterogeneous graph network like HeGCN is paying attention to the topology structure information of the graph, such as different types of points and edges and the attributes of each node at the same time.Homogenous GCN with regular graphs have achieved certain advantages in MDA prediction problems and can also provide useful insights for other tasks.P-values are obtained by conducting a paired t-test between the AUC of the three proposed model on four tasks, which are presented in Table 6.The results are very small indicate the statistical significance of improvements.
According to different tasks, all three GCN models have achieved the best model performance and obtain the highest AUC, accuracy, precision, recall and F1-Score on the Tp task, especially ReHoGCN.The test set Tp contains the associations of old miRNA, old diseases, that is, the miRNA or diseases in test set Tp are all included in the training phase while the other test  Compared with task Tp, the accuracy of the model on task Tn is reduced.This is because the construction of our negative sample is randomly selected from the unknown MDAs.There are many potential MDAs between them.Due to the reduction of samples, the false negative problem in the sample is very prominent.In addition, the distribution of test sets is very uneven, and the distribution compared with the training set is deviated, which will also affect the accuracy of the model.

Prediction performance of ReHoGCN compared with UReHoGCN and HeGCN with/without random ES
Although ReHoGCN achieved better prediction performance than HeGCN, it increases the node numbers and degrees leading to enormous computation complexity and memory access cost (MAC).ES is helpful to solve these problems (Supplementary Table S4).Figures 3 and 4 show that random ES can significantly reduce the training time (numerical value and P-value, Supplementary Tables S5 and S8) and MAC (numerical value and P-value, Supplementary Tables S6 and S8) with no loss of prediction accuracy (numerical value and P-value, Supplementary Tables S4 and S7) on model ReHoGCN and UReHoGCN.ES defines 'inf luence' from the graph connectivity perspective and considers joint information from node connections as well as node attributes.Only nodes having high inf luence are sampled by this sampler instead of information aggregation on full graph, which will avoid 'Neighbor Explosion' phenomena or over-smoothing problem.This ES has been proven to be unbias and has minimal variance contributed to its satisfying model performance.However, on the HeGCN model, we did not observe the same results as above.One possible reason is that with small number of nodes and links of the graph, the operation time and MAC proportion of sampling probability calculation and subgraph generation are not negligible relative to the entire full-graph calculation (detailed analysis, Supplementary Table S9).Therefore, the advantages of subgraph sampling are not shown.For subsequent comparison with other methods, we select the ReHoGCNES model as the best prediction performance.In conclusion, ReHoGCNES has three advantages: (i) high accuracy, (ii) high connectivity and efficiency and (iii) low training complexity.

Performance of ReHoGCNES model compared with the state-of-the-art network-based methods
In order to further prove the superiority of the proposed ReHoGCNES method, we compare it with five state-of-theart network-based methods published after 2022, including LAGCN [78], DEJKMDR [79], NSAMDA [80], HGANMDA [81] and HLGNN-MDA [82] under the same experimental conditions.The aforementioned methods applied varieties of graph construction strategies: (i) LAGCN firstly integrates three associations into a heterogeneous network and applies GCN with attention mechanisms to learn the embedding of miRNA and disease.
(ii) The DEJKMDR is a HeGCN model which randomly deletes edges to increase the diversity of data and reduces overfitting.
(iii) NSAMDA identified the MDAs based on neighbor selection graph attention networks.(iv) HGANMDA constructed a miRNAdisease-lncRNA heterogeneous graph and node-layer attention was applied to learn the importance of neighbor nodes based on different meta-paths.(v) HLGNN-MDA proposes heuristic learning network enabling it to learn information among homogenous and heterogeneous nodes.We verified the superiority of the ReHoGCNES method on the four datasets and Table 7 shows that ReHoGCNES is better than the five state-of-the-art methods on all four datasets especially on completely new dataset Tn, which shows that ReHoGCNES has good predictive ability for new data.5 and Table S10 showcase that our proposed model is superior to other traditional ML and DL models.The AUCs of our method are higher than SVM, GBDT, RF and DNN by 70.64, 7.12, 7.79 and 7.89%, respectively.Table 9 shows two-tailed P-values of paired t-test between the AUC of the ReHoGCNES and machine learning and deep learning models on four tasks indicating the statistical significance of improvements.The excellent performance of ReHoGCNES is attributed to its effective and powerful graph processing ability via adaptive neighbor feature aggregation.Compared with DNN models, ReHoGCNES proves more suitable for handing graph-structured data like MDA predictions.Although tree-based methods like GBDT and RF models outperformed DNN, it is worth to note that there is a chance of improving the predictive performance of the DNN model by increasing the number of hidden layers since we've fine-tuned the parameters on the default number of network layers.Both ML and DL models failed to predict new MDPs.Both GCN-based method requires newly discovered both miRNAs and diseases in the graph composition process, which is impossible to achieve in the real world.Therefore, although ReHoGCNES achieved better performance in the Tn dataset, it still has limitations for associations between new miRNAs and new diseases.

Case studies
We employed the devised method ReHoGCNES to predict new MDAs for three prevalent human diseases (breast neoplasms, prostate neoplasms, pancreatic neoplasms), leveraging the known associations from HMDD.Our method was executed to ascertain the prediction scores of candidate miRNAs in relation to  these neoplasms.The scores of the candidate associations were ranked, and the top 30 candidate associations with these diseases were selected.Subsequently, the prediction results were validated by two databases: dbDEMC V3.0 [ 83] and PhenomiR [84].The results from the three case studies are detailed in Tables 10-12.Consequently, 28 of the top 30 miRNAs were confirmed to be associated with breast neoplasms, 27 of the top 30 miRNAs with prostate neoplasms, and 28 of the top 30 miRNAs with pancreatic neoplasms.These findings confirm that our method is capable of effectively predicting potential MDAs.

DISCUSSION
In this paper, we first proposed a novel GCN-based graphbuilding strategy method (ReHoGCNES) based on regular graph with random ES to predict MDAs.The experimental results on four datasets demonstrate that the proposed ReHoGCNES-MDA method has achieved excellent results, which implicitly reveal steadily that degree distribution like uniform distribution of a graph does play an important role in enhancement of prediction performance.The robust performance of the proposed ReHoGC-NES method can be attributed to several crucial factors.First, we integrated beneficial similarity features to build a homogenous network, thereby maximizing the utility of available information through the aggregation of neighborhood data.Second, we employed an inventive graph construction technique by utilizing regular graphs for the GCN mode.A range of experiments illustrated that regular graphs offer advantages in terms of graph connectivity, edge connectivity and subgraphs, which has advantages in accuracy, scalability and training complexity.The connectivity of the graph is an important indicator to measure its network resilience, which can allow the features of the central node to propagate to the spatial neighbors more effectively to obtain better model performance.In addition, the GCN's Laplacian matrix eigenvalues have a very large relationship with the connectivity of the graph, and the second eigenvalue of the matrix is separately named the algebraic connectivity of the graph (Supplementary Proofs A).As far as we know, this is the first study to explore and compare the graph structure construction Additionally, this method holds promise for predicting related miRNAs of diseases even in the absence of known associations.
In sum, ReHoGCNES-MDA highlights the significance of regular graphs and effectively predicts potential MDAs.Nevertheless, our method has certain limitations.The datasets used for network construction may encompass noise and outliers.Additionally, the performance of the ReHoGCNES model warrants further validation with larger number of samples.Hence, our further research endeavors will be directed toward model validation utilizing more refined data.

Key Points
• The detection of miRNA-disease linkages via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach.

Figure 1 .
Figure 1.Basic structures of homogenous graph (A) and heterogeneous graph (B) used in MDA prediction task.

Figure 2
Figure 2 depicted the workf low of the proposed method.First, IDSM and IMSM are calculated as integrated node features of miRNA and disease.Next, three graph architectures of GCN model were proposed in our work.(a) Homogenous GCN with regular graph structure through the k-NN algorithm (ReHoGCN model); (b) homogenous GCN with non-regular graph structure (UReHoGCN model) through k-means algorithms; (c) traditional HeGCN (HeGCN model).IDSM and IMSM are concatenated as node features for ReHoGCN and UReHoGCN model, while IDSM and IMSM are features of disease-disease similarity network and miRNA-miRNA similarity network in parallel for HeGCN model.Then, an ES was used for GCN training process.Comprehensive data regarding the hyperparameters and architectures of the

Figure 2 .
Figure 2. The workf low of the proposed method, where the MDP represents the miRNA-disease pair, GCN represents GCN.Three graph architectures of GCN model were proposed in our work.(a) Homogenous GCN with regular graph structure through the k-NN algorithm (ReHoGCN model); (b) homogenous GCN with non-regular graph structure (UReHoGCN model) through k-means algorithms; (c) traditional HeGCN (HeGCN model).IDSM and IMSM are concatenated as node features for ReHoGCN and UReHoGCN model, while IDSM and IMSM are features of disease-disease similarity network and miRNA-miRNA similarity network in parallel for HeGCN model.Then, an ES was used for GCN training process.

Figure 4 .
Figure 4. Analysis of the inf luence of random ES from two aspects.(a) Total training time of three graph construction methods with/without ES on four tasks; (b) MAC of three graph construction methods with/without ES on four tasks.

Figure 5 .
Figure 5. Results of ReHoGCNES performance compared with classic machine learning and deep learning models on four test sets ((a) Tp, (b) Tm, (c) Td, (d) Tn) obtained from HMDD.

Table 1 :
Summary of different graph structures and prediction types

Table 2 :
Number of entries of the training set and four different test sets obtained from HMDD v2.0 and HMDD v3.0 for miRNAs and diseases, respectively, which also underline the satisfactory performance and validate the effectiveness of the proposed ReHoGCNES-MDA method.In sum, ReHoGCNES-MDA underscores the significance of a regular graph and can proficiently predict potential MDAs.

Table 3 :
Detailed degree statistics of the three proposed models

Table 4 :
Experimental environment u,v and λ v are defined as follows:

Table 5 :
Prediction performance of our proposed methods ReHoGCN compared with UReHoGCN and HeGCN without ES on four tasks

Table 6 :
Two-tailed P-values of paired t-test for AUC on four datasets sets contain new miRNA or new diseases or both new.ReHoGCN models also achieve good results on the completely new dataset Tn, which shows that it has good predictive ability for new data.The robustness and scalability of the model is validated.

Table 8 shows
P-values of paired t-test for AUC between ReHoGCNES model and on four datasets.All P-values are far <0.01 meaning that the difference probability among models due to sampling error is <0.01 and model improvement by ReHoGCNES is

Table 8 :
Two-tailed P-values of paired t-test between the AUC of the ReHoGCNES and other state-of-the-art models on four tasks

Table 9 :
Two -tailed P-values of paired t-test between the AUC of the ReHoGCNES and machine learning and deep learning models on four tasks

Table 13 :
Comparison between ReHoGCNES and five state-of-the-art models about predicted miRNAs associated with three kinds of neoplasms.The n column represents n of top 30 new MDAs are confirmed.The overlap ratio represents the repetition rate between associations found by other methods and proposed ReHoGCNES method • Three GCNs of different structures were tested on four experimental tasks on the MDA prediction problem for the first time.ReHoGCN model (homogenous GCN with regular graph structure through the k-NN algorithm) achieved best performances.• The random ES implemented on ReHoGCN model can significantly reduce the training time and MAC with no loss of prediction accuracy.• ReHoGCNES model that integrated beneficial similarity features and aggregated neighborhood data can offer advantages in terms of accuracy and scalability.